System and method for symbol-space based compression of patterns

ABSTRACT

A method and system for symbol-space based pattern compression is provided. The method includes identifying a plurality of combinations of symbols in an input sequence, each identified combination of symbols appearing in the input sequence above a predefined threshold, the input sequence having a first length; generating an output sequence having a second length by replacing each identified combination of symbols with a unique symbol, wherein each unique symbol is not a previously used symbol, wherein the second length is shorter than the first length; and storing the output sequence as a data layer.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. patent application Ser. No.14/573,652 filed on Dec. 17, 2014, now allowed. The Ser. No. 14/573,652application is a continuation of U.S. patent application Ser. No.13/874,159 filed on Apr. 30, 2013, now U.S. Pat. No. 8,922,414, whichclaims the benefit of U.S. Provisional Application No. 61/763,554 filedon Feb. 12, 2013. All of the above-referenced applications are herebyincorporated by reference.

TECHNICAL FIELD

The disclosure generally relates to pattern recognition and big-data,and more particularly to systems and methods that make use of patternrecognition techniques and big-data storage and analytics.

BACKGROUND

Recognition of patterns and properly assembling them for storage,preferably in a compact way, is continuously being attempted. However,unless otherwise specified, it cannot be assumed that all patterns areevenly distributed along the data. Because some patterns can be moreprominent than others, they are likely to have a larger number ofoccurrences, while other patterns may be very rare. In addition, somepatterns may be correlated to each other, and together formpattern-combinations which may also be very popular. This poses aproblem to applications for pattern recognition systems. For example, toretrieve a similarity measurement between two content-segments, it isnot enough to consider the number of corresponding patterns, but theprobability of occurrence of each pattern should be considered as well.In addition, correlation between patterns should also be considered. Forexample, if two patterns always appear together, in essence they do notcontain more information than a single pattern.

Such an effect, in turn, is detrimental for the scalability and theaccuracy of a pattern-recognition system. That is, if the handling ofdifferent patterns is spread between multiple machines of thepattern-recognition system, then most machines dealing with“less-popular” patterns will remain inactive, whereas a few machines,processing “popular” patterns, will be overburdened with accesses. It isalso impossible to distribute the handling of patterns according totheir a-priory probability because of correlations between patterns, ofwhich no assumptions can be made. Furthermore, in general, to scale up apattern-recognition system it would be preferable to avoid duplicationof the pattern-space and the need to hold a copy of the patterns in eachmachine.

Reduction of multiple symbols, such as a pattern, to a smaller number ofmanageable symbols that are easily recognizable is performed manually incertain cases. Consider, for example, a sequence of notes that arecombined into a chord. A chord is a combination of two or more notesthat are played, or otherwise heard as if being played simultaneously.However, the chords are repetitive in nature and hence, in order toreduce the number of notes provided to a performer, the sequence ofnotes is reduced to a symbol of a chord, which represents the pluralityof notes. Hence, the chord marked as C7 means that the performer is toplay the root note A, the minor third C, and a perfect fifth E, so thatthey appear to be played simultaneously. A person can easily translatethe symbol of a chord into the specific notes it represents. Similarly,the creation of the mapping between two sets of symbols is performedmanually based on specific rules to which rules may be added, deleted ormodified as necessary.

It would be advantageous to provide an efficient solution for patternrecognition that overcomes the deficiencies of the prior art,particularly the requirement for human manual intervention in therecognition process.

SUMMARY

Certain embodiments disclosed herein include a method for symbol-spacebased pattern compression. The method comprises identifying a pluralityof combinations of symbols in an input sequence, each identifiedcombination of symbols appearing in the input sequence above apredefined threshold, the input sequence having a first length;generating an output sequence having a second length by replacing eachidentified combination of symbols with a unique symbol, wherein eachunique symbol is not a previously used symbol, wherein the second lengthis shorter than the first length; and storing the output sequence as adata layer.

Certain embodiments disclosed herein also include a system forsymbol-space based pattern compression. The system comprises aprocessing unit; and a memory, the memory containing instructions that,when executed by the processing unit, configure the system to: identifya plurality of combinations of symbols in an input sequence, eachidentified combination of symbols appearing in the input sequence abovea predefined threshold, the input sequence having a first length;generate an output sequence having a second length by replacing eachidentified combination of symbols with a unique symbol, wherein eachunique symbol is not a previously used symbol, wherein the second lengthis shorter than the first length; and store the output sequence as adata layer.

BRIEF DESCRIPTION OF THE DRAWINGS

The subject matter disclosed herein is particularly pointed out anddistinctly claimed in the claims at the conclusion of the specification.The foregoing and other objects, features and advantages of thedisclosed embodiments will be apparent from the following detaileddescription taken in conjunction with the accompanying drawings.

FIG. 1 is an original sequence of symbols having a first symbol spaceused as an input for processing according to one embodiment.

FIG. 2 is a first level table utilized for mapping and input sequenceand for the determination of replacement symbols for sequences ofsymbols according to one embodiment.

FIG. 3 is a sequence of symbols representing a reduced number of symbolscreated using a second symbol space larger than the first symbol spaceaccording to one embodiment.

FIG. 4 is a second level table utilized for mapping an input sequenceand for the determination of replacement symbols for sequences ofsymbols according to one embodiment.

FIG. 5 is a sequence representing a reduced number of symbols createdusing a third symbol space larger than the second symbol space accordingto one embodiment.

FIG. 6 is a third level table utilized for mapping of the input sequenceand for the determination of replacement symbols for sequences ofsymbols according to another embodiment.

FIG. 7 is a sequence representing a reduced number of symbols createdusing a fourth symbol space larger than the third symbol space accordingto one embodiment.

FIGS. 8A through 8D are diagrams of the image symbols line, square,circle and triangle respectively and used according to one embodiment.

FIGS. 9A and 9B are higher level image symbols of a “house” and a“chair” respectively, created from basic symbols according to oneembodiment.

FIGS. 10A through 10D are basic symbols of a line, a square, a circle,and a triangle respectively, each having corresponding connection ports.

FIGS. 11A through 11C are higher level image symbols of a “man”, a“woman” and a “dog”, respectively, created from basic symbols accordingto one embodiment.

FIG. 12 is a flowchart depicting the creation of a data layersresponsive of an input of a sequence of input symbols for achievingsymbol-space based compression of patterns according to one embodiment.

FIG. 13 is a system for creation of data layers responsive of an inputsequence of input symbols for achieving symbol-space based compressionof patterns according to one embodiment.

DETAILED DESCRIPTION

It is important to note that the embodiments disclosed herein are onlyexamples of the many advantageous uses of the innovative teachingsherein. In general, statements made in the specification of the presentapplication do not necessarily limit any of the various claimedembodiments. Moreover, some statements may apply to some inventivefeatures but not to others. In general, unless otherwise indicated,singular elements may be in plural and vice versa with no loss ofgenerality. In the drawings, like numerals refer to like parts throughseveral views.

The various techniques disclosed herein allow mapping natural signalsand/or features extracted from natural signals to compressedrepresentations in high-dimensional space with properties ofrepeatability and invariance. Specifically, for a given input space, aplurality of data layers (Cortex) are created respective of the inputdata that is represented by more symbols, i.e., at least one more symbolthan the immediately previous list of symbols, but with a shorteroverall length, i.e., a length that is shorter from the immediatelypreceding length of symbols' sequence.

Accordingly, information is represented in a more compact way and moreeasily recognized over a symbol-space. The input data may be of animage, video, text, voice and other types of data that can be mapped ina plurality of data layers. In one embodiment, the disclosed techniquescan be described as an ability to determine what a “table” is bycomparing it to an “ideal table” of a higher data layer. Specifically, apattern-space is generated that is big enough to be spread acrossmultiple machines (or processors) of a pattern-recognition system, eachmachine handling a different range in the pattern-space. Thepattern-space includes one or more patterns.

According to one embodiment, input “patterns” are received from amechanism (or system) designed for finding “patterns” incontent-segments. The input patterns are loosely defined as arbitraryrepresentations of some features in a content-segment. However, itshould be noted that the received “Patterns” are also associated withany information as to what these patterns represent and about thelocality of these patterns. A collection of such patterns is referred toherein as a “descriptor”. A content segment may be represented by one ormore “descriptors”. For example, if the content-segment is a 2D image,Patterns may indicate that specific shapes or colors were detected inthat image.

According to the disclosed embodiments, the pattern-space of thereceived input patterns are transformed into a pattern-space that islarger in size, but more balanced, de-correlated, repeatable andinvariant as further described in greater detail herein. Specifically,in each descriptor, the original input patterns are replaced with newpatterns, which represent combinations of patterns from the originalpattern-space. Accordingly, the disclosed techniques are utilized tofirst make the pattern-space larger, thus improving scalability;secondly, the disclosed techniques flatten and de-correlate thepattern-space for better accuracy; and thirdly, the techniques toimprove invariance and repeatability by including large-scaleinformation on the probability of patterns on content-segments from asingle domain.

Following is a general description of the operation of disclosedtechniques (realized by the system and methods discussed below)according to one embodiment. A Cortex is a function F: S₀→S_(n), wherefor any k {k=0, 1, , , , n}, S_(k) is a pattern-space, which includesone or more patterns. The initial pattern-space S₀ is defined by theinput patterns; each following symbol-space, which is the next layer ofa Cortex, is defined and created by an “iteration function F_(k)” F_(k):S_(k)→S_(k+1) which converts any set of patterns in S_(k) to a set ofpatterns in S_(k+1) according to one or more predefined conversionrules. The conversion rules in any “iteration function” are generatedaccording to the distribution of patterns in a large-scale collection ofpatterns, such as content-segments, from a certain domain. For example,if a domain of interest is “2D natural photos”, some large N descriptorsin S_(k) are generated are denoted S₀ . . . S_(N). The content-segmentsin these examples include 2D images of nature.

According to one embodiment, an iteration for creation of a data layerF_(k) of a Cortex is defined according to the distribution of patternsin those N descriptors and has several steps. First, S_(k+1) isinitialized as a copy of S_(k). Then, S₁ . . . S_(N) are used to build acollection of common combinations of patterns in S_(k), denoted{c_(i)⊂S_(k)}, where ⊂ is a subset function. Then, for each combinationc_(i) {i=1, 2 . . . N} where its probability in S₁ . . . S_(N) is largerthan a first threshold T₁, a new label is added to S_(k+1), thusincreasing the space by one. For each “original label” in S_(k) having aprobability in {S₁ . . . S_(N)} that is larger than a second thresholdT₂, the respective “original label” is removed from S_(k+1). Finally,for each “original label” in S_(k) where the number of combinationsc_(i) in which the respective “original label” is included is largerthan a third threshold T₃, then that respective “original label” isremoved from S_(k+1). Typically the thresholds T₁, T₂ and T₃ arenumerical values representing a certain probability, examples of whichare discussed herein.

At the completion of this process a consistent definition of the datalayer F_(k) is achieved, where each pattern in S_(k+1) is either apattern in S_(k) or strongly defined as a collection of patterns inS_(k), thereby testing for the collection indicates whether the newpattern should be included. The result is that S_(k+1) is a largersignature-space, where patterns that are very common have been removedand/or replaced with combinations of other patterns.

The threshold parameters T₁, T₂ and T₃ should be carefully tuned, so asnot to lose valuable patterns, and at the same time to avoid inclusionof “noisy” patterns. The hierarchical process can be repeated anydesired number of times, with any choice of thresholds, for as long asthe length decreases and the number of unique symbols used increases.Each iteration creates a data layer which is a more compactrepresentation of the immediately preceding data layer. That is, aplurality of symbols of the respective input patterns are mapped to asingle symbol.

In one embodiment, the input patterns or data is unique to a domain, forexample, text in English, human faces, classical music, and so on. Inanother embodiment, any combination of data from domains can be used.According to an embodiment, symbols are joined if they have a highcorrelation. However, symbols can also be combined even if they are notcorrelated by showing a common co-occurrence, i.e., a tendency to appeartogether without being actually correlated.

It should be appreciated that there are at least two important outcomesto the process described herein. First the process is scalable, that is,after performing the process described herein, the pattern-space islarge and balanced, thus the pattern-space can be spread evenly betweenmultiple machines, with each machine handling a sub-range of thepattern-space. Therefore, a “route” strategy can be used for queryingrather than query duplication.

Another important outcome of the disclosed process is its accuracy. Thatis, in the data layer iteration-building process, a set of “real-world”data S₁ . . . S_(N) is used to base the necessary statistics. This meansthat by applying the teachings disclosed herein, more weight is given topatterns that are less popular (and therefore more significant) in arandom sample. Thus, assuming that the input content-segments are fromthe same domain, the generated data layers are used to separate “noisy”patterns from valuable “detection” patterns. Furthermore, the datalayers generated according to the disclosed process provide a functionthat is similar to a brain function in its ability to recognize apattern as belonging to a higher level concept.

It should be noted that the disclosed pattern recognition process isparticularly advantageous in analysis of big-data. Big-data typicallyrefers to a collection of data sets that are large and complex thatcannot be analyzed using on-hand database management tools ortraditional data processing applications, such as those discussed in therelated art. As noted above, the disclosed process results in apattern-space that is large and balanced, thus the pattern-space can bespread evenly between multiple machines, where each machine handles asub-range of the pattern-space. Therefore, the disclosed process can beefficiently utilized for big data analysis.

Following are two non-limiting examples for the operation of the processfor generating the data layers. In the first non-limiting example, shownin FIG. 1, an original set comprising a sequence of 500 symbols ispresented, where there are four (4) different symbols: “R”, “G”, “B” and“Y”. Applying the process described herein, that is, identifyingsymbols, patterns, or sequences, and applying a threshold value todetermine which sequence of symbols are to be replaced by anothersymbol, results in the table shown in FIG. 2. In this case, the symbolsequences present that are combinations of either three or two symbols,are determined as to the number of appearances in the input sequence. Itshould be noted that all possibilities of sequences are considered,although not all the sequences are shown in FIG. 2. The longest sequenceis the data itself; it appears only one time and is below the requiredrepeated threshold.

According to an exemplary embodiment, the first level table shown inFIG. 2 contains only sequences that appear above a first threshold T₁,for example, a threshold value equal to or greater of 10. From thosesequences that are above the value T₁, only those having a longersequence, if contained within the sequence shown in the table, are to beused for symbol replacement. For example, the sequences “BYY” and “YY”are dependent, however, the longer sequence is preferred over theshorter sequence. Therefore, as depicted in FIG. 2, while the sequence“YY” appears 28 times in the input sequence, it appears only 8 timesindependently, whereas the sequence BYY appears 13 times independently.With a threshold determined to be equal to or greater than 10, thesequence “YY” is not replaced by a substitute symbol, while the sequence“BYY”, is replaced by the symbol “A”. The resultant sequence after thisfirst iteration of the data layers generation process step is shown inFIG. 3. The sequence presented in FIG. 3 shows an increase in the numberof symbols in the symbol-space from 4 symbols to 17 symbols (A, C, D, E,F, H, I, J, K, L, M, N, O, Y, R, G, B) and a corresponding reduction inthe number of symbols in the sequence that was reduced from 500 symbolsin the initial sequence to 283 symbols in the subsequent sequence.

The process can now continue with performing another iteration tofurther reduce the number of symbols in the sequence by expanding thesymbol space. For the next iteration, the input sequence (shown in FIG.3) comprises the reduced symbol sequence of 283 symbols. FIG. 4 shows asecond level table that is produced using threshold values equal to orgreater than 5. As a result, certain sequences of symbols are eachreplaced by a corresponding single symbol, thereby reducing the numberof symbols in the output sequence to 262 symbols with a symbol space of20 (A, C, D, E, F, H, I, J, K, L, M, N, O, Y, R, G, B, P, S, T). Theresulting output sequence of the second iteration is shown in FIG. 5.

Yet another iteration is performed by the disclosed process where athreshold value equal to or greater than 3 is shown in the table of FIG.6, and the resultant reduced sequence of symbols is shown in FIG. 7. Ascan be noticed from the symbols listed in the “Replace Symbol” column inFIG. 6, the symbol-space is increased to 37 symbols. The output symbolsequence (FIG. 7) is reduced to a length of 221 symbols, i.e., less thanhalf of the original length of 500 symbols. It should be noted that eachset of sequences generated at each iteration (as shown in FIGS. 3, 5,and 7) is referred to as a data layer or a Cortex layer (a data layer ofthe Cortex).

Therefore, according to the disclosed embodiments, with respect of thecreation of data layers for the example above, it is understood that atthe entry data layer, there is a set of symbol sequence of 500 symbols,using a symbol-space of 4. In the second data layer, after the firstdata layer processing, there is a sequence of symbols containing 283symbols, using a symbol-space of 17. In a third data layer, after thesecond data layer processing, there is a sequence of symbols containing262 symbols, using a symbol-space of 20. Lastly, in the fourth datalayer, after the third data layer processing, there is a sequence ofsymbols containing 221 symbols, using a symbol-space of 37.

In one embodiment, symbols may be replaced by signatures, such as thosedescribed in U.S. Pat. Nos. 8,112,376, 8,266,185, 8,312,031 and8,326,775, 8,655,801, and 8,386,400, all assigned to common assignee andare hereby incorporated by reference for all that they contain.

In a second non-limiting example for the operation of the disclosed datalayers generation process, four image symbols, a line 810, a square 820,a circle 830 and a triangle 840, are shown in FIGS. 8A through 8Drespectively, and used according to an embodiment. Combinations of thebasic image symbols 810, 820, 830 and 840 may result in various higherlevel images symbols, a house 910 or a chair 920, shown in FIGS. 9A and9B respectively, and used according to an embodiment. The image symbolof a house 910 is comprised of a square 820-1 and a triangle 840-1combined in a specific way, recognized as a symbol image of a “house”.Similarly, the image symbol of a chair 920 is comprised of four symbolsof lines 810-1, 810-2, 810-3, and 810-4 combined in a specific way,recognized as a symbol image of a “chair”.

According to one embodiment, any one of the basic four image symbols810, 820, 830 and 840 are connectable to another basic image symbol 810,820, 830 or 840 at a connecting port. An exemplary and non-limitingdesignation of connection ports, each port numbered to differentiate itfrom another port, is shown in FIGS. 10A through 10D respectively. Forexample, but not by way of limitation, the line 1010 has three portsnumbered 1, 2, and 3, while the square 1020 has eight ports numbered 1,2, 3, 4, 5, 6, 7, and 8, and so on.

It should be understood that the number of connection ports assigned foreach basic image symbol 1010, 1020, 1030, and 1040 are merely examplesand each image symbol may be comprised of less or more connection ports.Each image symbol is further designated, for example, by anidentification character, for example, the line has the character “A”,the square, “B”, the circle, “C”, and the triangle “D”. The upper levelimage of a “house” shown in FIG. 9A could therefore be compactlydescribed as:

-   -   D(4)<0°>B(2)        This means that the image symbol “D” connects to the image        symbol “B” at ports “4” and “2” respectively, and at a relative        orientation of 0°. Similarly, the upper level image of a “chair”        shown in FIG. 9B could therefore be compactly described using        the following notation:    -   A(3)<[0°>A(1),90°>A(1),(3)<90°>A(1)]        This means that an image symbol “A” is connected through port 3        to port 1 of another image symbol “A” with a relative        orientation of 0°, and to port 1 of another image symbol “A”        with a relative orientation of 90°, which in turn is connected        through its port 3 to port 1 of another image symbol “A” with a        relative orientation of 90°.

According to one embodiment, a pattern identification and extraction isthereby possible as a result of the data layers (Cortex). FIGS. 11A, 11Band 11C depict three upper level symbols 1110 of a “man”, 1120 of a“woman” and 1130 of a “dog”, each comprised of the basic image symbolsshown in FIG. 10. Therefore, using the notation described above, thesymbol of a “man” 1110 can be described as:

-   -   C(6)<90°>A(1),(2)<0°>A(2)        The symbol of a “woman” 1120 can be described as:    -   C(6)<90°>A(1),(3)<0°>D(1)        And, the symbol of a “dog” 1130 can be described as:    -   C(6)<90°>A(1),(2)<0°>A(1),(3)<90°>A(2)

According to one embodiment, a common pattern is extracted, comprising abasic symbol of a circle “C” connecting via a connection port ‘6’ to asymbol of a line “A” at port ‘1’ in a relative orientation of 90°.Hence, the extracted common pattern can be described as:

-   -   C(6)<90°>A(1)

Then, the identified pattern receives a symbol within the data layer inwhich it was found. For example, the symbol Ω replaces the extractedcommon pattern C(6)<90°>A(1). Therefore, the symbol of a “man” 1110could be described in the current data layer as:

-   -   Ω(2)<0°>A(2)

The symbol of a “woman” 1120 could be described in the current datalayer as:

-   -   Ω(3)<0°>D(1)        And, the symbol of a “dog” 1130 can be described in the current        data layer as:    -   Ω(2)<0°>A(1),(3)<90°>A(2)

Therefore, using the disclosed process, the number of symbols hasincreased in this data layer. However, the data set itself is shorter.In one embodiment, a data layer comprises at least the collection ofsymbols used in an immediate previous data layer. Furthermore, in theabove example, C(6)<90>A(1) is a common pattern. This means that theprobability is that the combination C(6)<90>A(1) is larger than a firstthreshold T₁. Thus, a new label Ω is added to S_(k+1), hence increasingthe space by one. The probability is now that each element in thecombination, C and A, is larger than a second threshold T₂, thus therespective “original labels” (C and A) are removed from S_(k+1).Therefore, as can be understood the thresholds utilized in the disclosedprocess are based on the certain probabilities that an element will befound in the subsequent data layer.

FIG. 12 shows an exemplary and non-limiting flowchart 1200 that depictsthe creation of a data layer responsive of an input sequence of inputsymbols according to one embodiment. In S1210, an input including asequence of symbols is received. The symbols may be characters, images,sounds, video and other input sequences, including representations ofsignals, and the like.

In one embodiment, the sequence includes a set of signatures generatedfor multimedia content elements. Such signatures are generated asdiscussed, for example, in the above-referenced U.S. Pat. Nos.8,112,376, 8,266,185, 8,312,031, 8,655,801, and 8,386,400.

In S1220, all symbol combinations, i.e., two or more symbols that appearin a frequency (a number of appearances) that is above a predeterminedthreshold are identified. In S1230, included and derived combinations ofsymbol combinations identified in S1220 are removed. In one embodiment,this further entails the use of an additional threshold (e.g., thresholdT₂ discussed above) to further filter the resultant symbol combinationsused. For example, the symbol sequence ‘YYR’ is identified in the inputsequence (FIG. 1) as depicted in FIG. 2, but is not included in theresultant data layer.

In S1240, the remaining symbol combinations are each replaced by aunique new symbol. In one embodiment, the remaining symbol combinationsare those for which the number of appearances in the input sequence isabove the predefined threshold used to filter symbol combinations. InS1250, the resultant sequence of symbols is stored in memory as a datalayer that is subsequent to the input data layer.

In S1260, it is checked whether an additional data layer is to bederived, for the last generated data layer, and if so, executioncontinues with S1210, where the new input of a sequence of symbols isthat which was stored in memory in S1250; otherwise, executionterminates.

FIG. 13 shows an exemplary and non-limiting system 1300 for creation ofdata layers responsive of an input sequence of input symbols accordingto one embodiment. The system 1300 includes a processing unit (PU) 1310that may comprise one or more processing elements, such as acomputational core. The PU 1310 is communicatively connected to a memory1320. The memory 1320 may be comprised of both volatile and non-volatilememory, and may further be in proximity or remote from the PU 1310. Thememory 1320 contains instruction in a memory portion 1325, that whenexecuted by the PU 1310 performs at least the data layers generationprocess described in detail above, for example, with respect offlowchart 1200.

The sequence of input symbols may be provided from an external sourcevia the input/output interface 1330 that is communicatively coupled tothe PU 1310, or from the memory 1320. The input sources to generate thedata layers include, but are not limited to, sensory sources such asaudio, video, touch, smell, text, and so on. Moreover, combinations ofdifferent input data sources are also possible.

In one embodiment, the system 1300 also includes a signature generator1340 that is communicatively connected to the PU 1310 and/or the memory1320. The signature generator 1340 may generate signatures respective ofthe data provided through one or more sources connected to theinput/output interface 1330. The generated signatures are then processedby the PU 1310 to generate the data layers. An exemplary implementationfor the signature generator 1340 and its functionality can be found inat least the above-referenced U.S. Pat. Nos. 8,112,376, 8,266,185,8,312,031 and 8,326,775, 8,655,801, and 8,386,400.

A data layer maintains several properties. A higher-level data layerdemonstrates a greater symbol-space, i.e., space increases as new layersare generated. The data layer also maintains the probability of symbolsbeing closer increases while correlation between the symbols decreases.Symbols that are close to each other before the layering process arealso close after the process is performed.

According to another embodiment, the data layer maintains invariance,that is, two symbols that are complementary maintain an invariantproperty. For example, if the input data (sequence of symbols) is aface, the generated data layers are invariant with respect of a closedeye or an open eye of the same face. The generation of data layerscomprises common patterns, which are combinations of input patterns fromdifferent sources. The output of a data layer is a fusion of informationfrom multiple sources represented by a generic set of indices.

According to another embodiment, all the properties of a data layer areimportant in the generated layer. That is, if, for example, an audiosource is too dominant compared to video the layer suppresses the audiopatterns by generating relevant common patterns. Moreover, if two datasources are correlated, the layer generates a de-correlated fusedrepresentation.

The various embodiments disclosed herein can be implemented as hardware,firmware, software or any combination thereof. Moreover, the software ispreferably implemented as an application program tangibly embodied on aprogram storage unit or computer readable medium. The applicationprogram may be uploaded to, and executed by, a machine comprising anysuitable architecture. Preferably, the machine is implemented on acomputer platform having hardware such as one or more central processingunits (“CPUs”), a memory, and input/output interfaces. The computerplatform may also include an operating system and microinstruction code.The various processes and functions described herein may be either partof the microinstruction code or part of the application program, or anycombination thereof, which may be executed by a CPU, whether or not suchcomputer or processor is explicitly shown. In addition, various otherperipheral units may be connected to the computer platform such as anadditional data storage unit and a printing unit. Furthermore, anon-transitory computer readable medium is any computer readable mediumexcept for a transitory propagating signal.

All examples and conditional language recited herein are intended forpedagogical purposes to aid the reader in understanding the disclosedembodiment and the concepts contributed by the inventor to furtheringthe art, and are to be construed as being without limitation to suchspecifically recited examples and conditions. Moreover, all statementsherein reciting principles, aspects, and embodiments, as well asspecific examples thereof, are intended to encompass both structural andfunctional equivalents thereof. Additionally, it is intended that suchequivalents include both currently known equivalents as well asequivalents developed in the future, i.e., any elements developed thatperform the same function, regardless of structure.

What is claimed is:
 1. A method for symbol-space based patterncompression, comprising: identifying a plurality of combinations ofsymbols in an input sequence, each identified combination of symbolsappearing in the input sequence above a predefined threshold, the inputsequence having a first length; generating an output sequence having asecond length by replacing each identified combination of symbols with aunique symbol, wherein each unique symbol is not a previously usedsymbol, wherein the second length is shorter than the first length; andstoring the output sequence as a data layer.
 2. The method of claim 1,further comprising: providing the output sequence as a new inputsequence for a subsequent generation of a data layer.
 3. The method ofclaim 1, wherein each combination of symbols having a combination lengthbelow a threshold combination length is not identified.
 4. The method ofclaim 1, further comprising: determining whether a first combination ofsymbols appears within a second combination of symbols, the firstcombination of symbols having a shorter combination length than thesecond combination of symbols, wherein the first combination of symbolsis not identified if the first combination of symbols appears within thesecond combination of symbols.
 5. The method of claim 1, wherein eachsymbol is any of: a character, an image, an audio signal, a videosignal, and a tangible representation of a signal.
 6. The method ofclaim 1, wherein the input sequence and the plurality of combinations ofsymbols are signatures.
 7. The method of claim 6, further comprising:generating a signature based on an input, wherein the input sequence isthe generated signature.
 8. A non-transitory computer readable mediumhaving stored thereon instructions for causing one or more processingunits to execute the method according to claim
 1. 9. A system forsymbol-space based pattern compression, comprising: a processing unit;and a memory, the memory containing instructions that, when executed bythe processing unit, configure the system to: identify a plurality ofcombinations of symbols in an input sequence, each identifiedcombination of symbols appearing in the input sequence above apredefined threshold, the input sequence having a first length; generatean output sequence having a second length by replacing each identifiedcombination of symbols with a unique symbol, wherein each unique symbolis not a previously used symbol, wherein the second length is shorterthan the first length; and store the output sequence as a data layer.10. The system of claim 9, wherein the system is further configured to:provide the output sequence as a new input sequence for a subsequentgeneration of a data layer.
 11. The system of claim 9, wherein eachcombination of symbols having a combination length below a thresholdcombination length is not identified.
 12. The system of claim 9, whereinthe system is further configured to: determine whether a firstcombination of symbols appears within a second combination of symbols,the first combination of symbols having a shorter combination lengththan the second combination of symbols, wherein the first combination ofsymbols is not identified if the first combination of symbols appearswithin the second combination of symbols.
 13. The system of claim 9,wherein each symbol is any of: a character, an image, an audio signal, avideo signal, and a tangible representation of a signal.
 14. The systemof claim 9, wherein the input sequence and the plurality of combinationsof symbols are signatures.
 15. The system of claim 14, wherein thesystem is further configured to: generate a signature based on an input,wherein the input sequence is the generated signature.